Goto

Collaborating Authors

 pu learning


Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

Zhang, Heng, Zhang, Chengzhi

arXiv.org Artificial Intelligence

The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.


Applications of Positive Unlabeled (PU) and Negative Unlabeled (NU) Learning in Cybersecurity

Dilworth, Robert, Gudla, Charan

arXiv.org Artificial Intelligence

This paper explores the relatively underexplored application of Positive Unlabeled (PU) Learning and Negative Unlabeled (NU) Learning in the cybersecurity domain. While these semi-supervised learning methods have been applied successfully in fields like medicine and marketing, their potential in cybersecurity remains largely untapped. The paper identifies key areas of cybersecurity--such as intrusion detection, vulnerability management, malware detection, and threat intelligence--where PU/NU learning can offer significant improvements, particularly in scenarios with imbalanced or limited labeled data. We provide a detailed problem formulation for each subfield, supported by mathematical reasoning, and highlight the specific challenges and research gaps in scaling these methods to real-time systems, addressing class imbalance, and adapting to evolving threats. Finally, we propose future directions to advance the integration of PU/NU learning in cybersecurity, offering solutions that can better detect, manage, and mitigate emerging cyber threats.


Harnessing PU Learning for Enhanced Cloud-based DDoS Detection: A Comparative Analysis

Dilworth, Robert, Gudla, Charan

arXiv.org Artificial Intelligence

This paper explores the application of Positive-Unlabeled (PU) learning for enhanced Distributed Denial-of-Service (DDoS) detection in cloud environments. Utilizing the $\texttt{BCCC-cPacket-Cloud-DDoS-2024}$ dataset, we implement PU learning with four machine learning algorithms: XGBoost, Random Forest, Support Vector Machine, and Na\"{i}ve Bayes. Our results demonstrate the superior performance of ensemble methods, with XGBoost and Random Forest achieving $F_{1}$ scores exceeding 98%. We quantify the efficacy of each approach using metrics including $F_{1}$ score, ROC AUC, Recall, and Precision. This study bridges the gap between PU learning and cloud-based anomaly detection, providing a foundation for addressing Context-Aware DDoS Detection in multi-cloud environments. Our findings highlight the potential of PU learning in scenarios with limited labeled data, offering valuable insights for developing more robust and adaptive cloud security mechanisms.


Learning A Disentangling Representation For PU Learning

Zamzam, Omar, Akrami, Haleh, Soltanolkotabi, Mahdi, Leahy, Richard

arXiv.org Artificial Intelligence

In this paper, we address the problem of learning a binary (positive vs. negative) classifier given Positive and Unlabeled data commonly referred to as PU learning. Although rudimentary techniques like clustering, out-of-distribution detection, or positive density estimation can be used to solve the problem in low-dimensional settings, their efficacy progressively deteriorates with higher dimensions due to the increasing complexities in the data distribution. In this paper we propose to learn a neural network-based data representation using a loss function that can be used to project the unlabeled data into two (positive and negative) clusters that can be easily identified using simple clustering techniques, effectively emulating the phenomenon observed in low-dimensional settings. We adopt a vector quantization technique for the learned representations to amplify the separation between the learned unlabeled data clusters. We conduct experiments on simulated PU data that demonstrate the improved performance of our proposed method compared to the current state-of-the-art approaches. We also provide some theoretical justification for our two cluster-based approach and our algorithmic choices.


Instance Selection and Instance Weighting for Cross-Domain Sentiment Classification via PU Learning

Xia, Rui (Nanjing University of Science and Technology) | Hu, Xuelei (Nanjing University of Science and Technology) | Lu, Jianfeng (Nanjing University of Science and Technology) | Yang, Jian (Nanjing University of Science and Technology) | Zong, Chengqing (National Laboratory of Pattern Recognition, Institute of Automation)

AAAI Conferences

Due to the explosive growth of the Internet online reviews, we can easily collect a large amount of labeled reviews from different domains. But only some of them are beneficial for training a desired target-domain sentiment classifier. Therefore, it is important for us to identify those samples that are the most relevant to the target domain and use them as training data. To address this problem, a novel approach, based on instance selection and instance weighting via PU learning, is proposed. PU learning is used at first to learn an in-target-domain selector, which assigns an in-target-domain probability to each sample in the training set. For instance selection, the samples with higher in-target-domain probability are used as training data; For instance weighting, the calibrated in-target-domain probabilities are used as sampling weights for training an instance-weighted naive Bayes model, based on the principle of maximum weighted likelihood estimation. The experimental results prove the necessity and effectiveness of the approach, especially when the size of training data is large. It is also proved that the larger the Kullback-Leibler divergence between the training and test data is, the more effective the proposed approach will be.